Specifying the question

To identify individuals likely to click on ads

Metrics of success

The analysis will be successful if I can find individuals likely to click on the ads

Context

A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog.
She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process.
She has employed a Data Science Consultant to help her identify which individuals are most likely to click on her ads.

Importing the dataset and libraries

library('data.table')
library('tidyverse')
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.3.3     ✔ purrr   0.3.4
✔ tibble  3.0.4     ✔ dplyr   1.0.2
✔ tidyr   1.1.2     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::between()   masks data.table::between()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::first()     masks data.table::first()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::last()      masks data.table::last()
✖ purrr::transpose() masks data.table::transpose()
library('chron') # For working with datetime

advertising <- fread('http://bit.ly/IPAdvertisingData')
Column descriptions

‘Daily Time Spent on Site’: consumer time on site in minutes

‘Age’: cutomer age in years

‘Area Income’: Avg. Income of geographical area of consumer

‘Daily Internet Usage’: Avg. minutes a day consumer is on the internet

‘Ad Topic Line’: Headline of the advertisement

‘City’: City of consumer

‘Male’: Whether or not consumer was male

‘Country’: Country of consumer

‘Timestamp’: Time at which consumer clicked on Ad or closed window

‘Clicked on Ad’: 0 or 1 indicated clicking on Ad

obtained from kaggle discussion

since the timestamps show times on leaving and entering the site clicked on ad entry 0 implies they were leaving the site while 1 they were entering the site

previewing the top of the dataset

head(advertising)

Cleaning the dataset

checking the datatypes of the columns in the dataset

str(advertising)
Classes 'data.table' and 'data.frame':  1000 obs. of  10 variables:
 $ Daily Time Spent on Site: num  69 80.2 69.5 74.2 68.4 ...
 $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
 $ Area Income             : num  61834 68442 59786 54806 73890 ...
 $ Daily Internet Usage    : num  256 194 236 246 226 ...
 $ Ad Topic Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
 $ City                    : chr  "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
 $ Male                    : int  0 1 0 1 0 1 0 1 1 1 ...
 $ Country                 : chr  "Tunisia" "Nauru" "San Marino" "Italy" ...
 $ Timestamp               : chr  "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
 $ Clicked on Ad           : int  0 0 0 0 0 0 0 1 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 

Summary of the dataset

summary(advertising)
 Daily Time Spent on Site      Age         Area Income    Daily Internet Usage
 Min.   :32.60            Min.   :19.00   Min.   :13996   Min.   :104.8       
 1st Qu.:51.36            1st Qu.:29.00   1st Qu.:47032   1st Qu.:138.8       
 Median :68.22            Median :35.00   Median :57012   Median :183.1       
 Mean   :65.00            Mean   :36.01   Mean   :55000   Mean   :180.0       
 3rd Qu.:78.55            3rd Qu.:42.00   3rd Qu.:65471   3rd Qu.:218.8       
 Max.   :91.43            Max.   :61.00   Max.   :79485   Max.   :270.0       
 Ad Topic Line          City                Male         Country         
 Length:1000        Length:1000        Min.   :0.000   Length:1000       
 Class :character   Class :character   1st Qu.:0.000   Class :character  
 Mode  :character   Mode  :character   Median :0.000   Mode  :character  
                                       Mean   :0.481                     
                                       3rd Qu.:1.000                     
                                       Max.   :1.000                     
  Timestamp         Clicked on Ad
 Length:1000        Min.   :0.0  
 Class :character   1st Qu.:0.0  
 Mode  :character   Median :0.5  
                    Mean   :0.5  
                    3rd Qu.:1.0  
                    Max.   :1.0  

Checking for null values

colSums(is.na(advertising))
Daily Time Spent on Site                      Age              Area Income 
                       0                        0                        0 
    Daily Internet Usage            Ad Topic Line                     City 
                       0                        0                        0 
                    Male                  Country                Timestamp 
                       0                        0                        0 
           Clicked on Ad 
                       0 

There were no null values in the dataset

Checking for Duplicates

dim(advertising[duplicated(advertising)])
[1]  0 10

There were no duplicates found

spliting the timestamo column to date and time

time_stamp <- advertising$Timestamp
parts <- t(as.data.frame(strsplit(time_stamp,' ')))

advertising$dates <- as.Date(parts[,1]) #saving dates
advertising$times <- as.times(parts[,2])#saving time
# view(advertising)
age <- advertising$Age
area_income <- advertising$`Area Income`
time_on_site <- advertising$`Daily Time Spent on Site`
internet_usage <- advertising$`Daily Internet Usage`
gender <- as.character(advertising$Male)
ad <- as.character(advertising$`Clicked on Ad`)

date <- advertising$dates
time <- advertising$times
country <- advertising$Country
city <- advertising$City
ad_topic <- advertising$`Ad Topic Line`

Checking for outliers

boxplot(time_on_site)$out

numeric(0)

No outliers in amount of time spent on the site

boxplot(age)$out

numeric(0)
# outlier(advertising$Age)

No outliers in the ages of the users

boxplot(area_income)$out

[1] 17709.98 18819.34 15598.29 15879.10 14548.06 13996.50 14775.50 18368.57
# outlier(advertising$`Area Income`)

Some few outliers in the areas of income mostly the lower income areas, removing these may cause a loss of valuable information, hence will not be removed

boxplot(internet_usage)$out

numeric(0)
# outlier(advertising$Age)

there were no outliers in the internate usage time

boxplot(time)

There were no outliers in the time users were accessing or leaving the site

boxplot(date)

There were no outliers in the dates users were accessing or leaving the site

EDA

Univariate Analysis

get.mode <- function(v){
  uniq <- unique(v)
  # gets all the unique values in the column
  # match (v, uniq) matches a value to the unique values and returns the index
  # tabulate (match (v, uniq)) takes the values in uniq and counts the number of times each integer occurs in it.
  # which.max() gets the index of the first maximum in the tabulated list
  # then prints out the uniq value
  uniq[ which.max (tabulate (match (v, uniq)))]
}
mean(date); median(date); get.mode(date)
[1] "2016-04-09"
[1] "2016-04-07"
[1] "2016-04-04"

Access to the site was balanced with a bit more before april , moreover there being more activity in april than other months, specifically on fourth april

max(date); min(date)
[1] "2016-07-24"
[1] "2016-01-01"

dates when users were accessing or leaving the site ranged from january 1 ,2016 and august 24 ,2016

The countries with the most consumers

table.country <- table(country) # creates a frequency table
view(table.country)#viewing the table
table.country <- table.country[order(-table.country)] # re-ordering the table
head(table.country,10) # previewing the ordered table
country
Czech Republic         France    Afghanistan      Australia         Cyprus 
             9              9              8              8              8 
        Greece        Liberia     Micronesia           Peru        Senegal 
             8              8              8              8              8 

The countries with the least consumers

tail(table.country)
country
     Marshall Islands            Montserrat            Mozambique 
                    1                     1                     1 
              Romania Saint Kitts and Nevis              Slovenia 
                    1                     1                     1 

Countries where there were the most ad clicks

only.ad <- country[ad==1]
table.country.ad <- table(only.ad)
view(table.country.ad)
table.country.ad <- table.country.ad[order(-table.country.ad)]
head(table.country.ad,10)
only.ad
    Australia      Ethiopia        Turkey       Liberia Liechtenstein 
            7             7             7             6             6 
 South Africa   Afghanistan        France       Hungary       Mayotte 
            6             5             5             5             5 

Cities with the most activity on the site

table.city <- table(city)
view(table.city)
table.city <- table.city[order(-table.city)]
head(table.city,10)
city
      Lisamouth    Williamsport Benjaminchester       East John    East Timothy 
              3               3               2               2               2 
       Johnstad        Joneston      Lake David      Lake James       Lake Jose 
              2               2               2               2               2 
only.ad <- city[ad==1]
table.city.ad <- table(only.ad)
view(table.city.ad)
table.city.ad <- table.city.ad[order(-table.city.ad)]
head(table.city.ad,10)
only.ad
  Lake David   Lake James    Lisamouth Michelleside   Millerbury   Robertfurt 
           2            2            2            2            2            2 
  South Lisa  West Amanda West Shannon Williamsport 
           2            2            2            2 
mean(age); median(age); get.mode(age)
[1] 36.009
[1] 35
[1] 31

most consumers were 31 years, with the average age at 36 years implying its skewed to the left

max(age); min(age)
[1] 61
[1] 19

ages ranged from minimum( 19 ) to maximum( 61 ) years

quantile(age,probs=c(0.05,0.95))
   5%   95% 
23.95 52.00 

most of the people ranged between 23 and 52 years

var(age); sd(age)
[1] 77.18611
[1] 8.785562

There is little deviation(8.8) in ages when moving from one consumer to the next

ggplot(advertising,aes(age))+ geom_density()

the ages are skewed to the left, alot of people are younger

mean(time_on_site); median(time_on_site); get.mode(time_on_site)
[1] 65.0002
[1] 68.215
[1] 62.26

average time on site was 65 minutes, with more people spending 62 minutes on site.

max(time_on_site); min(time_on_site)
[1] 91.43
[1] 32.6

time on the site ranged from 32 to 91 minutes

quantile(time_on_site,probs=c(0.05,0.95))
     5%     95% 
37.5765 86.1995 

most people spent between 37.6 and 86 minutes in the site.

var(time_on_site); sd(time_on_site)
[1] 251.3371
[1] 15.85361

There is some deviation(15.8) in time on site when moving from one consumer to the next, but given that it is totally dependent on preference it could be seen as a small deviation.

ggplot(advertising,aes(time_on_site))+ geom_histogram(fill='#222222')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Time has two peaks around 40 and 80 minutes,showing two classes of consumers, ones who spend a longer time on site and those who spend less time.

mean(time); median(time); get.mode(time)
[1] 12:09:09
[1] 12:05:51
[1] 17:39:06

The average time consumers accessed onr left the site was at noon, with most people accessing it at 5:39 pm

max(time); min(time)
[1] 23:59:06
[1] 00:00:48

access times ranged all day (24 hours)

quantile(time,probs=c(0.05,0.95))
      5%      95% 
01:13:50 22:49:45 

most people accessed the site between 1:13 am and 22:49 pm

mean(area_income); median(area_income); get.mode(area_income)
[1] 55000
[1] 57012.3
[1] 61833.9

The average income area was at 55,000, while the income area with the most consumers was 61,833.9, since mean is lower than the median more consumers are above the midian

max(area_income);min(area_income)
[1] 79484.8
[1] 13996.5

The areas of income ranged from 13996.5 to 79484.8

quantile(area_income,probs=c(0.05,0.95))
      5%      95% 
28275.30 73600.72 

most people were between 28275.30 and 73600.72 area income brackets

var(area_income); sd(area_income)
[1] 179952406
[1] 13414.63

There is a some deviation(13414) in area of income when moving from one consumer to the next, given that it is income .

ggplot(advertising,aes(area_income))+ geom_density()

The density plot is skewed to the right implying alot more people were above the median price bracket

ggplot(advertising,aes(area_income))+ geom_histogram(fill = "#222222", colour = "#038b8d")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mean(internet_usage); median(internet_usage); get.mode(internet_usage)
[1] 180.0001
[1] 183.13
[1] 167.22

the average internet usage was 180 minutes, however alot of consumers spent 167.22 minutes online

max(internet_usage); min(internet_usage)
[1] 269.96
[1] 104.78

The range of the time spent was from 104.78 to 269.96 minutes

quantile(internet_usage,probs=c(0.05,0.95))
      5%      95% 
113.5095 246.7345 

most people spent between 113.5 t0 246.7 minutes on the internet

var(internet_usage); sd(internet_usage)
[1] 1927.415
[1] 43.90234

There is little deviation(43.9) on internet usage when moving from one consumer to the next

ggplot(advertising,aes(internet_usage))+ geom_density()

there were two peaks at around 125 and 225 minutes on the internet, showing two brackets of people spending different amounts of time on the internet

ggplot(advertising,aes(internet_usage))+ geom_histogram(fill = "#222222", colour = "#038b8d")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(advertising,aes(gender))+ geom_bar()

There were more females who accessed the site

ggplot(advertising,aes(ad))+ geom_bar(fill='#222222')

There was an equal number of people whoa accessed the site from both not clicking ads and clicking them

Bivarriate

# library('viridis')
ggplot(advertising,aes(gender,fill=ad))+ geom_bar()

Most females accessing the site had clicked an ad while most males visiting the site had not clicked an ad

ggplot(advertising,aes(internet_usage,time_on_site))+ geom_point(alpha=0.5)+
  geom_quantile(size=1 ,alpha = 1,color="#1abc9c")
Smoothing formula not specified. Using: y ~ x

People who spent more time on the internet tended to stay longer on the site

ggplot(advertising,aes(internet_usage,time_on_site,color=ad))+ geom_point(alpha=0.75)+
  geom_quantile(size=0.9 ,alpha = 1,quantiles=c(0.25,0.5,0.75))
Smoothing formula not specified. Using: y ~ x
Smoothing formula not specified. Using: y ~ x

Most people who clicked on an ad spent less time on the site and the internet compared to those who did not click an add. However considering the groups individually consumers spent less time on the site the longer they spent on the internet

ggplot(advertising,aes(age,internet_usage))+ geom_point(alpha=0.5)+
  geom_quantile(size=1 ,alpha = 1,color="#1abc9c")
Smoothing formula not specified. Using: y ~ x

There was a decline in internet usage as consumers got older.

ggplot(advertising,aes(age,internet_usage,color=ad))+ geom_point(alpha=0.75)+
  geom_quantile(size=0.9 ,alpha = 1,quantiles=c(0.25,0.5,0.75))
Smoothing formula not specified. Using: y ~ x
Smoothing formula not specified. Using: y ~ x

Internet usage from those who exited the site is increasing with age Internet usage for those who clicked was fairly constant with a slight decline with age, moreover most were 35 years and above(an older generation)

ggplot(advertising,aes(age,time_on_site))+ geom_point(alpha=0.5)+
  geom_quantile(size=1 ,alpha = 1,color="#1abc9c")
Smoothing formula not specified. Using: y ~ x

Time on site went down the older the consumer got, Content may be geared towards a younger demographic.

ggplot(advertising,aes(age,time_on_site,color=ad))+ geom_point(alpha=0.75)+
  geom_quantile(size=0.9 ,alpha = 1,quantiles=c(0.25,0.5,0.75))
Smoothing formula not specified. Using: y ~ x
Smoothing formula not specified. Using: y ~ x

Time on the site Consumers leaving the site increased with age, the content could be more relevant to consumers around 30 years or they are loyal to the site. The time on the site was fairly constant with those who clicked the ad (around 52 minutes) at different ages.

ggplot(advertising,aes(area_income, fill=ad,color='black'))+ geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

most of people clicking on the ads ranged from 40000 and 60000 areas of income. People leaving the site ranged from 50000 and above exceeded the amount of people coming into the site through ads

ggplot(advertising,aes(age,area_income))+ geom_point(alpha=0.5)+
  geom_quantile(size=1 ,alpha = 1,color="#1abc9c")
Smoothing formula not specified. Using: y ~ x

Areas of income decreased as age increased

ggplot(advertising,aes(area_income,age,color=ad))+ geom_point(alpha=0.75)+
  geom_quantile(size=0.9 ,alpha = 1,quantiles=c(0.25,0.5,0.75))
Smoothing formula not specified. Using: y ~ x
Smoothing formula not specified. Using: y ~ x

The areas of income of those leaving the site were increasing with age, while those who clicked on ads decreased slightly with age average age of those clicking the ad was 40

ggplot(advertising,aes(time_on_site,area_income))+ geom_point(alpha=0.5)+
  geom_quantile(size=1 ,alpha = 1,color="#1abc9c")
Smoothing formula not specified. Using: y ~ x

The time on site increased with The area of income.

ggplot(advertising,aes(time_on_site,area_income,color=ad))+ geom_point(alpha=0.75)+
  geom_quantile(size=0.9 ,alpha = 1,quantiles=c(0.25,0.5,0.75))
Smoothing formula not specified. Using: y ~ x
Smoothing formula not specified. Using: y ~ x

areas of income was fairly constant for those clicking ads, area of income may not have much of an impact on time on site by those clicking ads, however time on site increases as area of income decreases meaning it could have an impact on retention of consumers

advert <- subset(advertising, select = c(1:4,10,12))
heatmap(cor(advert),Rowv = NA, Colv = NA,scale = "column", margins = c(10,10))